1 Introduction

Here we are going to analyse the Annotation packages of Bioconductor. See the home of the analysis here.

2 Load data

First we read the latest data from the Bioconductor project. There are two files, one with the download stats from 2009 until today and another with the download stats of the software packages, we will only use the first one:

load("stats.RData")
stats <- stats[Category == "Annotation", ]
stats
##                                   Package Year Month Nb_of_distinct_IPs
##      1:   BSgenome.Dmelanogaster.UCSC.dm3 2014    08                  1
##      2:   BSgenome.Dmelanogaster.UCSC.dm3 2017    01                 74
##      3:   BSgenome.Dmelanogaster.UCSC.dm3 2017    02                163
##      4:   BSgenome.Dmelanogaster.UCSC.dm3 2017    03                135
##      5:   BSgenome.Dmelanogaster.UCSC.dm3 2017    04                125
##      6:   BSgenome.Dmelanogaster.UCSC.dm3 2017    05                175
##      7:   BSgenome.Dmelanogaster.UCSC.dm3 2017    06                162
##      8:   BSgenome.Dmelanogaster.UCSC.dm3 2016    01                125
##      9:   BSgenome.Dmelanogaster.UCSC.dm3 2016    02                221
##     10:   BSgenome.Dmelanogaster.UCSC.dm3 2016    03                181
##     ---                                                                
## 104967:      MafDb.gnomADex.r2.0.1.hs37d5 2017    05                  7
## 104968:      MafDb.gnomADex.r2.0.1.hs37d5 2017    06                  8
## 104969:  SNPlocs.Hsapiens.dbSNP149.GRCh38 2017    02                  6
## 104970:  SNPlocs.Hsapiens.dbSNP149.GRCh38 2017    03                  3
## 104971:  SNPlocs.Hsapiens.dbSNP149.GRCh38 2017    04                  3
## 104972:  SNPlocs.Hsapiens.dbSNP149.GRCh38 2017    05                 14
## 104973:  SNPlocs.Hsapiens.dbSNP149.GRCh38 2017    06                 15
## 104974: TxDb.Ggallus.UCSC.galGal5.refGene 2017    04                  3
## 104975: TxDb.Ggallus.UCSC.galGal5.refGene 2017    05                 15
## 104976: TxDb.Ggallus.UCSC.galGal5.refGene 2017    06                 11
##         Nb_of_downloads   Category                Date
##      1:               1 Annotation 2014-08-01 02:00:00
##      2:             142 Annotation 2017-01-01 01:00:00
##      3:             285 Annotation 2017-02-01 01:00:00
##      4:             225 Annotation 2017-03-01 01:00:00
##      5:             207 Annotation 2017-04-01 02:00:00
##      6:             259 Annotation 2017-05-01 02:00:00
##      7:             304 Annotation 2017-06-01 02:00:00
##      8:             186 Annotation 2016-01-01 01:00:00
##      9:             297 Annotation 2016-02-01 01:00:00
##     10:             262 Annotation 2016-03-01 01:00:00
##     ---                                               
## 104967:               7 Annotation 2017-05-01 02:00:00
## 104968:               9 Annotation 2017-06-01 02:00:00
## 104969:               6 Annotation 2017-02-01 01:00:00
## 104970:               4 Annotation 2017-03-01 01:00:00
## 104971:               3 Annotation 2017-04-01 02:00:00
## 104972:              18 Annotation 2017-05-01 02:00:00
## 104973:              17 Annotation 2017-06-01 02:00:00
## 104974:               4 Annotation 2017-04-01 02:00:00
## 104975:              18 Annotation 2017-05-01 02:00:00
## 104976:              11 Annotation 2017-06-01 02:00:00

There have been 2571 Experimental packages in Bioconductor. Some have been added recently and some later.

3 Packages

3.1 Number

First we explore the number of packages being downloaded by month:

theme_bw <- theme_bw(base_size = 16)
scal <- scale_x_datetime(date_breaks = "3 months")
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
  geom_bar(stat = "identity") + 
  theme_bw +
  ggtitle("Packages downloaded") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
  scal + 
  xlab("")
Packages in Bioconductor with downloads

Figure 1: Packages in Bioconductor with downloads

The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages

ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
  geom_bar(stat = "identity") + 
  theme_bw +
  ggtitle("Downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  xlab("")
Downloads of packages

Figure 2: Downloads of packages

Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.

pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads), 
                  ymin = mean(Nb_of_downloads)-1.96*sd(Nb_of_downloads)/sqrt(.N),
                  ymax = mean(Nb_of_downloads)+1.96*sd(Nb_of_downloads)/sqrt(.N)), 
              by = Date], aes(Date, Number)) +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width=.1, position=pd) +
  geom_point() + 
  geom_line() +
  theme_bw +
  ggtitle("Downloads") +
  ylab("Mean download for a package") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  xlab("")
Downloads of packages per package. The error bar indicates the 95% confidence interval.

Figure 3: Downloads of packages per package
The error bar indicates the 95% confidence interval.

Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is less dispersion between packages downloads.

3.2 Incorporations

This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.

today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) + 
  geom_bar(stat="identity") + 
  theme_bw + 
  ggtitle("Packages with first download") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  xlab("")
New packages

Figure 4: New packages

We can see that there were more than 60 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 10 new downloads (Which would be new packages being added).

ggplot(histincorporation, aes(Date, Number)) + 
  geom_bar(stat="identity") + 
  theme_bw + 
  ggtitle("Packages with first download") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  xlab("") + 
  ylim(c(0, 20))
## Warning: Removed 15 rows containing missing values (position_stack).
New packages

Figure 5: New packages

Close view to the new packages not previously downloaded. ## Removed

Using a similar procedure we can approximate the packages deprecated and removed each month. In this case we look for the last date a package was downloaded, excluding the current month:

deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date",  "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) + 
  geom_bar(stat = "identity") + 
  theme_bw + 
  ggtitle("Packages without downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  ylab("Last seen packages") +
  xlab("")
Date where a package was last downloaded. Aproximates to the date when packages were removed from Bioconductor.

Figure 6: Date where a package was last downloaded
Aproximates to the date when packages were removed from Bioconductor.

Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 3000 packages as per last month. In total there are 1128 packages downloaded. We further explore how many time between the incorporation of the package and the last download.

df <- merge(incorporation, deprecation, by = "Package")
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(60*60*24*365) # Transform to years
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Years")
abline(v = mean(timeBioconductor), col = "red")
abline(v = median(timeBioconductor), col = "green")
Time of packages between first and last download

(#fig:time.package)Time of packages between first and last download

Packages tend to stay up to 10 years. Not surprisingly the number of packages incorporated before 2009 and still in the repository are of 0 packages. But those packages not removed how do they do in Bioconductor?

4 Packages downloads

4.1 Ratio downloads per IP

We can start comparing the number of downloads (different from 0) by how many IPs download each package.

ggplot(stats, aes(Nb_of_distinct_IPs, Nb_of_downloads, col = Package)) + 
  geom_point() + 
  theme_bw + 
  geom_smooth(method = "lm") + 
  xlab("Number of distinct IPs") + 
  ylab("log10(Number of downloads)") + 
  ggtitle("Downloads by different IP") +
  geom_abline(slope = 2) + 
  guides(col = FALSE)
Downloads and distinct IPs of all months and packages. Each color is a package, the black line represents 2 downloads per IP.

Figure 7: Downloads and distinct IPs of all months and packages
Each color is a package, the black line represents 2 downloads per IP.

Not surprisingly most of the package has two downloads from the same IP, one for each Bioconductor release (black line). However, there are some packages where few IPs download many times the same package, which may indicate that these packages are mostly installed in a few locations.

ratio <- stats[, .(slope = coef(lm(Nb_of_downloads~Nb_of_distinct_IPs))[2]), by = Package]
ratio <- ratio[order(slope, decreasing = TRUE), ]
ratio <- ratio[!is.na(slope), ]
ratio$Package <- as.character(ratio$Package)
ratio
##                                           Package     slope
##    1:               BSgenome.Hsapiens.NCBI.GRCh38 7.3759554
##    2:                                 hgu95aprobe 6.0319987
##    3:                             org.EcK12.eg.db 5.3779110
##    4:                             hs133bptentrezg 5.2747253
##    5:                               pd.hg.u133a.2 5.2467415
##    6:                                hs133phsenst 5.2105263
##    7:                  BSgenome.Celegans.UCSC.ce2 4.6543039
##    8:                                  mirna10cdf 4.5150913
##    9:                             hs133xptenstcdf 4.4647887
##   10:                                hs133xptense 4.4615385
##   ---                                                      
## 2511:               BSgenome.Ggallus.UCSC.galGal5 0.8140669
## 2512:              BSgenome.Mmulatta.UCSC.rheMac8 0.7919463
## 2513: IlluminaHumanMethylation27kanno.ilmn12.hg19 0.7409949
## 2514:                            mta10probeset.db 0.7306502
## 2515:                                hgug4845a.db 0.7126742
## 2516:          BSgenome.Ptroglodytes.UCSC.panTro5 0.6421801
## 2517:              alternativeSplicingEvents.hg19 0.6264259
## 2518:                   mta10transcriptcluster.db 0.4765625
## 2519:                                    AHEnsDbs 0.4473684
## 2520:                                hgfocusprobe 0.3349607

We can see that the package with more downloads from the same IP is BSgenome.Hsapiens.NCBI.GRCh38, followed by, hgu95aprobe, org.EcK12.eg.db and the forth one is hs133bptentrezg.

Now we explore if there is some seasons cycles in the downloads, as in figure ?? seems to be some cicles.

4.2 By date

First we can explore the number of IPs per month downloading each package:

ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Distinct IP downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Distinct IP per package

Figure 8: Distinct IP per package

As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Downloads per IP") +
  ylab("Downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Downloads per year

Figure 9: Downloads per year

Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Downloads per package every three months") +
  ylab("Downloads") +
  scal +
  ylim(0, 10000)+
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Downloads per year

Figure 10: Downloads per year

There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw+
  ggtitle("Downloads per package every three months") +
  ylab("Downloads") +
  scal +
  ylim(0, 2500)+
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
## Warning: Removed 108 rows containing missing values (geom_path).
Downloads per year

Figure 11: Downloads per year

As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.

Maybe there is a relationship between the downloads and the number of IPs per date

ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Ratio") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Ratio downloads per IP per package

Figure 12: Ratio downloads per IP per package

We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:

ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Ratio") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE) +
  ylim(1, 5)
Ratio downloads per IP per package

Figure 13: Ratio downloads per IP per package

But most of the packages seem to be more or less constant and around 2.

5 Models

One problem to compare the evolution of the packages is that they started at different moments, and as seen with time the number of downloads have been increasing as well as the number of packages. So we need to normalize the starting dates:

norm <- stats[, .(Norm = as.numeric(Date)/as.numeric(max(Date)), 
                   Downloads = Nb_of_downloads/max(Nb_of_downloads)), by = Package]
ggplot(norm, aes(Norm, Downloads, col = Package)) + 
  geom_line() + 
  theme_bw() + 
  ggtitle("Downloads per stage of the package") +
  xlab("Date normalized") + 
  guides(col = FALSE)
Normalization of dates and downloads

Figure 14: Normalization of dates and downloads

We can observe a tendency to have a decrease of the number of downloads after being includedd in Bioconductor and later it raises again.

SessionInfo

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.4 ggplot2_2.2.1     BiocStyle_2.4.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     knitr_1.15.1     magrittr_1.5     munsell_0.4.3   
##  [5] colorspace_1.3-2 stringr_1.2.0    highr_0.6        plyr_1.8.4      
##  [9] tools_3.4.0      grid_3.4.0       gtable_0.2.0     htmltools_0.3.6 
## [13] yaml_2.1.14      lazyeval_0.2.0   rprojroot_1.2    digest_0.6.12   
## [17] tibble_1.3.0     bookdown_0.3     evaluate_0.10    rmarkdown_1.5   
## [21] labeling_0.3     stringi_1.1.5    compiler_3.4.0   scales_0.4.1    
## [25] backports_1.0.5